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<N . Abstract 
X:). 

Q . We propose a new method of learning a sparse nonnegative-definite target 

Ph ■ matrix. Our primary example of the target matrix is the inverse of a population 

■ covariance matrix or correlation matrix. The algorithm first estimates each col- 
umn of the matrix by scaled Lasso, a joint estimation of regression coefficients 
and noise level, and then adjusts the matrix estimator to be symmetric. The 

■ procedure is efficient in the sense that the penalty level of the scaled Lasso 
for each column is completely determined by the data via convex minimiza- 

I tion, without using cross-validation. We prove that this method guarantees the 

■ fastest proven rate of convergence in the spectrum norm under conditions of 
weaker form than those in the existing analyses of other ii algorithms, and has 
faster guaranteed rate of convergence when the ratio of the ii and spectrum 
norms of the target inverse matrix diverges to infinity. A simulation study also 

^ ■ demonstrates the competitive performance of the proposed estimator. 

1 Introduction 

(N 

^ ■ We consider the estimation of the matrix inversion 0* satisfying S0* ^ I, given a 

data matrix S. When S is a sample covariance matrix, our problem is the estima- 
tion of the inverse of the corresponding population covariance matrix. The inverse 
^ , covariance matrix is also called precision matrix or concentration matrix. With the 

^ I dramatic advances in technology, the number of covariates is of greater order than 

^ ■ the sample size n in many statistical and engineering applications. In this case, the 

sample covariance matrix is always singular and thus it is difficult to compute the 
precision matrix. In such certain type of sparsity condition is required for 

proper estimation the precision matrix and for theoretical investigation of the esti- 
mation problem. In this paper, we will impose for simplicity an £q (maximum degree) 
sparsity condition on the target inverse matrix 0*. 

Many approaches have been proposed to estimate the sparse inverse matrix in 
the high dimensional setting. The ii penalization is one of the most popular meth- 
ods. Lasso-type methods, or convex minimization algorithms with the £i penalty 
on all entries of 0*, have been discussed by Banerjee, El Ghaoui and d'Aspremont 
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(2008), Friedman, Hastie and Tibshirani (2008) and more, and by Yuan and Lin 
(2007) with ii penalization on the off-diagonal matrix only. This is refereed to as the 
graphical Lasso (GLasso) due to the connection of the precision matrix to Gaussian 
Markov graphical models. In this GLasso framework, Rothman, Bickel, Levina and 
Zhu (2008) proved the convergence rate {((p + s)/n)\ogp}^/'^ in the Frobenius norm 
and {{s/n) logp}^/^ in the spectrum norm, where s is the number of nonzero entries 
in the off-diagonal matrix. Ravikumar, Wainwright, Raskutti, and Yu (2008) pro- 
vided sufficient conditions for model selection consistency of this £i-regularized MLE. 
Lam and Fan (2009) studied on a general penalty function and achieved a sharper 
bound of order {(s/n) log p}^/^ under the Frobenius norm for the l\ penalty. Since 
the spectrum norm can be controlled via the Frobenius norm, this provides a suffi- 
cient condition (s/n) \ogp — j- for the convergence under the spectrum norm to the 
unknown precision matrix. This is a very strong condition since s is of the order dp 
for banded precision matrices, where d is the matrix degree, i.e. the largest number 
of nonzero entries in the columns. 

Some recent work suggests a weaker sufficient condition with the matrix degree. 
Yuan (2010) estimated each column of the inverse matrix by Dantzig selector and 
then seek a symmetric matrix close to the column estimation. When £i norm of the 
precision matrix is bounded, this method can achieve a convergence rate of order 
d{{\ogp) /n}^/"^ based on several matrix norms. The CLIME estimator, introduced 
by Cai, Liu and Luo (2011), has the same order of convergence rate, which uses 
the plug-in method with Dantzig selector to estimate each column, but followed by a 
simpler symmetrization step. They also require the boundedness of the ii norm of the 
unknown. In Yang and Kolaczyk (2010), the Lasso is applied to estimate the columns 
of the target matrix under the assumption of equal diagonal, and the estimation error 
is studied in the Frobenius norm for p = n'^. This column- by-column idea reduces 
a graphical model to a regression model. It was first introduced in Meinshausen 
and Biilhmann (2006) for identifying nonzero variables in a graphical model, called 
neighborhood selection. 

In this paper, we propose to apply the scaled Lasso (Sun and Zhang, 2011) column- 
by-column to estimate a precision matrix in the high dimensional setting. Based on 
the connection of precision matrix to linear regression by the block inversion formula, 
we construct a column estimator with the scaled Lasso, a joint estimator for the 
regression coefficients and noise level. Since we only need the sample covariance 
matrix in our procedure, this estimator could be extended to generate an approximate 
inverse of a nonnegative data matrix in a general setting. This scaled Lasso algorithm 
provides a fully specified map from the space of nonnegative-definite matrices to the 
space of symmetric matrices. For each column, the penalty level of the scaled Lasso 
is determined by data via convex minimization, without using cross-validation. 

We study theoretical properties of the proposed estimator for a precision matrix 
under a normality assumption. More precisely, we assume that the data matrix is the 
sample covariance matrix S = X'X /n, where the rows of X are iid A^(0, S*). Under 
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conditions on the spectrum norm and degree of the inverse of S*, we prove that the 
proposed estimator guarantees the rate of convergence of order d{{\ogp) /nY^'^ in the 
spectrum norm. The conditions are weaker than those in the existing analyses of 
other ii algorithms, which typically require the boundedness of the £i norm. When 
the ii norm of the target matrix diverges to infinity, the analysis of the proposed 
estimator guarantees a faster convergence rate than that of the existing literature. 
We state this main result of the paper in the following theorem. 

Theorem 1 Let be the scaled Lasso estimator, defined in ([^, ([^ and ([^ below, 
based on n iid observations from N{0, S*). Let p* and p* be the smallest and largest 
eigenvalues of correlation matrix of S*, 0* be the inverse of S* and d = maxj ^{j : 
Q*j 7^ 0} be the maximum degree of @* . Suppose that d^J {\ogp)/n — )• 0, the diagonal 
entries of the target matrix 0* are uniformly bounded, is bounded from and 
{p* / P*){{d'/n) log p}^^"^ < a for a small fixed a. Then, the spectrum norm of the 
estimation error — 0* is bounded by 

\\@-@*\\2 = Op{d^/{logp)/n). 

The convergence of the proposed scaled Lasso estimator under the sharper spec- 
trum norm condition on 0*, instead of the stronger bounded ii condition, is not 
entirely technical. It is a direct consequence of the faster convergence rate of the 
scaled Lasso estimator of the noise level in linear regression. To the best of our 
knowledge, it is unclear if other ii algorithms also achieve this fast convergence rate, 
either for the estimation of the noise level in linear regression or for the estimation 
of a precision matrix under the spectrum norm. However, it is still possible that this 
difference between the scaled Lasso and other methods is due to potentially coarser 
specification of the penalty level in other algorithms (e.g. cross validation) or a less 
accurate error bound in other analyses. 

The rest of the paper is organized as follows. In Section 2, we present the estima- 
tion for the inversion of a nonnegative definite matrix via the scaled Lasso. In Section 
3, we study error bounds of the proposed estimator for precision matrix. Simulation 
studies are presented in Section 4. In Section 5, we discuss oracle inequalities for the 
scaled Lasso with unnormalized predictors and the estimation of inverse correlation 
matrix. Section 6 includes all the proofs. 

We use the following notation throughout the paper. For a vector v = {vi, . . . ,Vp), 
W'^Wq ~ l^il'')^^^ is the ig norm with the special = ||i'||2 and the usual 
extensions H'ulloo = maxj \vj\ and \\v\\o = : vj ^ 0}. For matrices M, Mj^* is 
the j-th column of M, Ma,b represents the submatrix of M with rows in A and 
columns in B, = sup||^||^^]^ ||iVfi;||g is the ig matrix norm. In particular, || • ||2 

is the spectrum norm for symmetric matrices. Moreover, we denote the set {j} by j 
and denote the set {1, . . . ,p} \ {j} by —j in the subscripts. 
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2 Matrix inversion via scaled Lasso 



Let S be a nonnegative-definite data matrix and 0* be a positive-definite target 
matrix with S0* ^ I. In this section, we describe the relationship between positive- 
definite matrix inversion and hnear regression and propose an estimator for 0* via 
scaled Lasso, a joint convex minimization for the estimation of regression coefficients 
and noise level. 

We use scaled Lasso to estimate 0* column by column. Define aj > and 
f3 e by 

= = -e:,a| = -0:,(0*,)-^ (1) 

In the matrix form, we have the following relationship 

diag0* = diag(a-2, J = 1, . . . 0* = -/3(diag0*). (2) 

Let S* = Since {d/db_j)b"E*b = 2Slj. „6 = at b = /3, one may estimate 

the j-th column of f3 by minimizing the £i penalized quadratic loss. In order to shrink 
the estimation coefficients on the same scale, we adjust the penalty function with a 
normalizing factor, which leads to the ii penalized quadratic loss as follows, 

b'^b/2 + xY,^li'\h\ 

k=l 

subject to bj = —1. This is actually the Lasso for a linear regression model with 
normalized preditors. In practice, we first normalize the predictors by the weights 
^kk 7^ j) then the minimization problem can be solved by algorithms for the 
Lasso estimation. This is similar to Yuan (2010) and Cai, Liu and Luo (2011) who 
used the Dantzig selector to estimate each column. However, one still needs to choose 
a penalty level A and to estimate aj to recover 0* via ([2]). A solution to resolve these 
two issues is the scaled Lasso (Sun and Zhang, 2011): 

^j} = argmin + | + ^^^^16^1 : bj = -l} (3) 

'"^ k=l 

where Xq = A^y 2{\og p^/e)/n with a fixed A > 1. The scaled Lasso ((31) is a solution 
of joint convex minimization over {b,a} due to the convexity in {b,a) (Huber, 2009; 
Antoniatis, 2010). Since /3'S*/3 = (diag0*)-^0*(diag0*)-\ 

diag(/3'S*/3) = (diag0*)"^ = diag((j|, j = 1, . . . ,p). 

Thus, (E]) is expected to yield consistent estimates of aj. 

Sun and Zhang (2011) provided an iterative algorithm to compute the scaled 
Lasso estimator ([3]). We rewrite the algorithm in the form of matrices. For each j G 
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{1, . . . ,p}, the Lasso path is given by the estimates f3_j j{X) satisfying the following 
KKT conditions, for all k ^ j, 

te'/X J,_/A) = -Asgn(4,(A)), A,^0, 
A,^.(A) G A[-l, 1], = 0, 

where f3jj{X) = —1- Based on the Lasso path /3^j(A), the scaled Lasso estimator 
{/3^ j,aj} is computed iteratively by 

A^a.Ao, 3,,, ^ 3,,,(A). (5) 

Here the penalty level of the Lasso is determined by the data without using cross- 
validation. We then simply take advantage of the relationship (|2]) and compute the 
coefficients and noise levels by the scaled Lasso for each column 

diag© = diag(a-2, J = l,...,p), = -^(diage). (6) 

It is noticed that a good estimator for 0* should be a symmetric matrix. However, 
the estimator does not have to be symmetric. We improve this estimator by using 
a symmetrization step as in Yuan (2010), 

0= argmin ||iVf — 0||i, (7) 

M:M'^=M 

which can be solved by linear programming. Alternatively, semidefinite program- 
ming, which is somewhat more expensive computationally, can be used to produce 
a nonnegative definite in ([7]). According to the definition, the new estimator 
has the same ii error rate as 0. A nice property for symmetric matrix is that the 
spectrum norm is bounded by the ii matrix norm. The ii matrix norm can be given 
more explicitly as the maximum £i norm of the columns, while the ioo matrix norm 
is the maximum £i norm of the rows. Hence, for any symmetric matrix, the ii matrix 
norm is equivalent to the ioo matrix norm, so the spectrum norm can be bounded 
by either of them. Since both our estimator and the target matrix are symmetric, 
the error bound based on the spectrum norm could be studied by bounding the ii 
error, as typically done in the existing literature. We will discuss these error bounds 
in Section 3. 

To sum up, we propose to estimate the matrix inversion by ([5]), ([6]) and ([7]). The 
iterative algorithm (JSj) computes the regression coefficients and noise level based on a 
Lasso path determined by (HI). Then (El) translates the resulting estimators of ([5]) to 
column estimators and thus a preliminary matrix estimator is constructed. Finally, 
the symmetrization step ([7]) produces a symmetric estimate for our target matrix. 
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3 Error bounds for precision matrix 



In this section, we study the error — 0* for the inverse of a covariance matrix, 
which is our primary example of the target matrix. From now on, we suppose that 
the data matrix is the sample covariance matrix S = X' X /n, where the rows of X 
are iid A^(0, E*), and the target matrix is 0* = (S*)^-*^. 

Let and p* be the smallest and the largest eigenvalues of the correlation matrix 
(diagS*)-i/2S*(diagS*)-i/2. Define Sj = {i ^ j : 6,*^. ^ 0} and the degree of the 
matrix 

d = deg(0*) = max \ Sj \ + 1. 

3 

The following theorem gives the convergence rate based on the li matrix norm (£oo 
matrix norm) and spectrum norm. 

Theorem 2 Let e e (0,1/4) and Aq = A{{2/n)\og{p^ j e)y/'^ with A > I. Suppose 
that {d{\ogp) / n]^^'^ p* / < a for a small fixed a. Then with probability greater than 
1-Ae, 

||0-0*||2 < ||0-0*||i = ||0-0*||oo 

< c,xid\\@*\\ip:' + C2\odm8.xei,p:' (s) 

where Ci and C2 are constants depending on {A, a} only. 

Since the entries of 0* are bounded by the maximum of the diagonal, the ii 
matrix norm ||0*||i is of the same order as the matrix degree d. Thus, the in- 
equality ([8]) provides a convergence rate of the order dXo for either the ii matrix 
norm or the spectrum norm under the conditions d{{logp)/n}^^'^ — 0, p~^ = 0(1) 
and max(6*)fcfc = 0(1). The first condition is the main sparsity condition, and the 
other two are actually conditions on the £2 norm of the target matrix. To achieve 
the same convergence rate. Yuan (2010) and Cai, Liu and Luo (2011) both imposed 
the condition d{{\ogp)/n}^^'^ — )■ and the boundedness of the ii norm of the un- 
known. We replace the ii condition by the weaker boundedness of the spectrum 
norm of the unknown. The spectrum norm condition on the unknown is not only 
weaker, but also natural for the convergence in spectrum norm. The extra condition 
{(i(logp)/n}^/^p*/p* < a here is not strong. Under the conditions d{{logp) /n}^^^ — )■ 
and p~^ = 0(1), the extra condition only requires p* /d^^'^ to be small and it allows 
p* to diverge to infinity. 

This sharper error bound in the spectrum norm is a consequence of using the 
scaled Lasso estimator Sun and Zhang (2011) gave a convergence rate of order 
AqO? for the scaled Lasso estimation of the noise levels aj. With this faster rate of 
convergence, the estimation error in the diagonal is no longer the main term and thus 
the condition of the bounded ii norm of 0* can be weakened. 
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The consistency of the scaled Lasso estimation for the noise level is based on the 
ii error bound for the regression coefficients. Oracle inequalities for the ii error of 
the Lasso have been studied with various conditions, including the restricted isometry 
condition (Candes and Tao, 2007), the compatibility condition (van de Geer, 2007) 
and the sign-restricted cone invertibility factor (Ye and Zhang, 2010) among others. 
Sun and Zhang (2011) extended these oracle inequalities for the scaled Lasso. Here 
we use the version under the condition of £i sign-restricted cone invertibility factor 
(SCIF) 

SCIF^ (e, ^; S) = inf { : u e S)] > 0, (9) 

with the cone ^{^,S) = {w G : ||it5c||i < ^Hw^Hi} and the sign-restricted cone 

'^_(,^, S*) = {n G ^{S,,S) : < 0,Vj ^ S}. It is proved that, conditional on 

= 0,{l)\S,\Xl - = Op(l)|5,|Ao, (10) 

under the condition that SCIFi{C,i Sj; S_j) is bounded away from 0. This is guaran- 
teed by the conditions of Theorem [2l The error bound of ii matrix norm then follows 
from ([in]). 

4 Numerical study 

In this section, we compare the proposed matrix estimator based on scaled Lasso with 
graphical Lasso and CLIME (Cai, Liu and Luo, 2011). Three models are considered. 
The first two models are the same as model 1 and model 2 in Cai, Liu and Luo (2011). 
Model 2 was also studied in Rothman et al. (2008). 

• Model 1: 6^^ = 0.6l*-^l. 

• Model 2: Let @ = B + 61, where each off-diagonal entry in B is generated 
independently and equals to 0.5 with probability 0.1 or with probability 0.9. 
S is chosen such that the condition number of 0* is p. Finally, we rescale the 
matrix 0* to the unit in diagonal. 

• Model 3: The diagonal of the target matrix has unequal values. @ = D^^'^QD^^^ , 

where ilij = 0.6'*"-'' and D is a diagonal matrix with diagonal elements da = 
(4, + p_5)/{5(p_l)}. 

For each model, we generate a training sample of size 100 from a multivariate normal 
distribution with mean zero and covariance matrix S = 0~^ and an independent 
sample of size 100 from the same distribution for validating the tuning parameter A for 
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the graphical Lasso and CLIME. The GLasso and CLIME estimators are computed 
based on training data with various A's and we choose A by minimizing likelihood 
loss {trace(S0) — logdet(©)} on the validation sample. The proposed scaled Lasso 
estimator is computed based on the training sample alone with the penalty level 
Ao = {(logp)/n}^/^. Consider 6 different dimensions p = 30, 60, 90, 150, 300, 1000 and 
replicate 100 times for each case. The CLIME estimators for p = 300 and p = 1000 
are not computed due to the computational costs. 

Table 1 presents the mean and standard deviation of estimation errors based on 
100 replications. The estimation error is measured by several matrix norms: spectrum 
norm, matrix ii norm and Frobenius norm. We can see that scaled Lasso estimator, 
labelled as SLasso, outperforms the graphical Lasso (GLasso) in all cases, while it 
has a comparable peformance with the CLIME. 

Table 1: Estimation errors under various matrix norms of scaled Lasso, GLasso and 
CLIME for three models. 



Model 1 







Spectrum norm 






Matrix ii norm 






Frobenius norm 




p 


SLasso 


GLasso 


CLIME 


SLasso 


GLasso 


CLIME 


SLasso 


GLasso 


CLIME 


30 


2.41(0.08) 


2.49(0.14) 


2.29(0.21) 


2.93(0.11) 


3.09(0.11) 


2.92(0.17) 


4.09(0.12) 


4.24(0.26) 


3.80(0.36) 


60 


2.61(0.05) 


2.94(0.05) 


2.68(0.10) 


3.10(0.09) 


3.55(0.07) 


3.27(0.09) 


6.16(0.10) 


7.15(0.15) 


6.32(0.28) 


90 


2.67(0.05) 


3.07(0.03) 


2.87(0.09) 


3.19(0.08) 


3.72(0.06) 


3.42(0.07) 


7.73(0.11) 


9.25(0.12) 


8.42(0.31) 


150 


2.74(0.04) 


3.19(0.02) 


3.05(0.04) 


3.28(0.08) 


3.88(0.06) 


3.55(0.06) 


10.22(0.13) 


12.55(0.09) 


11.68(0.20) 


300 


2.80(0.03) 


3.29(0.01) 


NA 


3.38(0.07) 


4.06(0.05) 


NA 


14.77(0.11) 


18.44(0.09) 


NA 


1000 


2.87(0.03) 


3.39(0.00) 


NA 


3.52(0.06) 


4.44(0.07) 


NA 


27.59(0.12) 


35.11(0.06) 


NA 


Model 2 






Spectrum norm 






Matrix ii norm 






Frobenius norm 




P 


SLasso 


GLasso 


CLIME 


SLasso 


GLasso 


CLIME 


SLasso 


GLasso 


CLIME 


30 


0.75(0.08) 


0.82(0.07) 


0.81(0.09) 


1.32(0.16) 


1.49(0.15) 


1.45(0.18) 


1.90(0.10) 


1.84(0.09) 


1.87(0.11) 


60 


1.07(0.05) 


1.15(0.06) 


1.19(0.08) 


1.97(0.16) 


2.21(0.12) 


2.20(0.23) 


3.31(0.08) 


3.18(0.13) 


3.42(0.09) 


90 


1.49(0.04) 


1.54(0.05) 


1.61(0.04) 


2.63(0.16) 


2.89(0.16) 


2.90(0.17) 


4.50(0.06) 


4.40(0.11) 


4.65(0.08) 


150 


1.98(0.03) 


2.02(0.05) 


2.06(0.03) 


3.31(0.17) 


3.60(0.15) 


3.65(0.19) 


6.02(0.05) 


6.19(0.16) 


6.33(0.08) 


300 


2.85(0.02) 


2.89(0.02) 


NA 


4.50(0.14) 


4.92(0.17) 


NA 


9.35(0.05) 


9.79(0.05) 


NA 


1000 


5.35(0.01) 


5.52(0.01) 


NA 


7.30(0.21) 


7.98(0.15) 


NA 


18.34(0.05) 


20.81(0.02) 


NA 


Model 3 






Spectrum norm 






Matrix ii norm 






Frobenius norm 




P 


SLasso 


GLasso 


CLIME 


SLasso 


GLasso 


CLIME 


SLasso 


GLasso 


CLIME 


30 


1.75(0.10) 


2.08(0.10) 


1.63(0.19) 


2.24(0.14) 


2.59(0.10) 


2.17(0.20) 


2.52(0.10) 


2.91(0.16) 


2.37(0.25) 


60 


2.09(0.08) 


2.63(0.04) 


2.10(0.10) 


2.58(0.13) 


3.10(0.05) 


2.65(0.14) 


3.81(0.09) 


4.84(0.08) 


3.98(0.13) 


90 


2.24(0.07) 


2.84(0.03) 


2.38(0.18) 


2.72(0.12) 


3.30(0.06) 


2.91(0.12) 


4.79(0.08) 


6.25(0.08) 


5.37(0.37) 


150 


2.40(0.06) 


3.06(0.02) 


2.76(0.05) 


2.89(0.11) 


3.45(0.04) 


3.18(0.09) 


6.35(0.09) 


8.43(0.07) 


7.75(0.08) 


300 


2.54(0.05) 


3.26(0.01) 


NA 


3.05(0.10) 


3.58(0.03) 


NA 


9.20(0.09) 


12.41(0.04) 


NA 


1000 


2.68(0.05) 


3.47(0.01) 


NA 


3.26(0.09) 


3.73(0.03) 


NA 


17.2(0.09) 


23.55(0.02) 


NA 



5 More results 

5.1 Oracle inequalities for scaled Lasso estimator 

In the proof of the theoretical results for the proposed estimator, we use oracle in- 
equalities for the estimation error associated with a linear model without normalizing 
the predictors. In the discussion section, we describe this aspect of our results. Con- 
sider a linear model as follows, 

y = Xf3 + e, e^N{0,a^Q. 
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Let S = X'X/n, D = diagS, X = XD^'^ and E = i:)-^/^ 5^^-1/2^ ^^^^^ 
to penalize the coefficients on the same scale, we use a weighted £1 norm of the 
coefficients as the penalty function. Consider the estimator 

{3, a} = argminj "^"^^"' + + X^D^'Mi]- (H) 

This is actually the scaled Lasso as we use in matrix estimation in Section 2. It is 
equivalent to the estimation based on normalized predictors: 

{a, a} = arg mm | + - + Ao||a|K| (12) 

with 3 = -D^^/^S. 

The following theorem gives the oracle inequalities for the estimation of regression 
coefficients and noise level. 

Theorem 3 Let {ct,^,a} be as m and a* = \\y - X jSy/n^^'^ , S = {k : 

f3k 7^ 0}, z* = \\x\y - X(3)/n\\^/a* and^>l. 
(i)In the event z* < {1 + t+)-^/^\o{^ - l)/({ + 1), 

ITTT^y ^137^' l'"-"l'^^ 2eAo(l-r-)V2 - (13) 

ii3-/3ik< ^^^y: (14) 



2eAo(l-r-)i/2niinfcD 



1/2 ■ 
fcfc 



where T- = 0i(OAg|5|/^C/Fi(e, ^; S) and r+ = MO^llS]/ SCIFi{^, S;i:) with 

constants ^ > 1, MO = 4^7(1 + 0' and MO = m ' 1)/(1 + 0'- 

(ii)LetXo > {(2/n)log(p/e)}i/7e + l)/{(e-l)(l-T-)}. Forn-2> \og{p/e) -> 00, 



P{z* < (1 - r_)Ao(e - l)/(e + 1)} > 1 - (1 + o(l))6/v/7rlog(p/6). 

Theorem [3] is an immediate extension from the oracle inequalities for the scaled 
Lasso in Sun and Zhang (2011). With an extra condition that x'^^Xi^/n{k = 1, . . . ,p) 
are uniformly bounded from zero, the estimators have the same convergence rate as 
that for a regression model with normalized predictors. The error rates (ITOj) follows 
from Theorem |3] and are used to prove the convergence rate of matrix estimation. 

5.2 Error bounds for inverse correlation matrix 

Given the data matrix S = X'X/n, we are also interested in estimating the inverse 
correlation matrix 
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where D = diag(S*). When constructing the matrix estimator 0, we solve the hnear 
regression problem with normalized predictors. If the estimator /3 in (E]) is replaced 
by a. as in (fT2l) . the resulting matrix is an estimator for Z?^/^(S*)~^. Thus, the inverse 
correlation matrix is estimated by 

n = -adiag(a~^S^.^- J = 1, . . . ,p), 

n = argmin \\M - (15) 

M:M'^=M 

The error bounds of this estimator are well established as follows. 

Theorem 4 Let e e (0,1/4), Aq = A{{2/n)\og{p^ /e)}^/^ with A > 1 and dXo 0. 
Suppose that {d{\ogp)/ny^'^p* / p^, < a for a small fixed a. Then with probability 
greater than 1 — 4e, 

11^2-^^*112 < = ||i7-ri*||oo 

< CiA2rf||n||i||0||2 + C2Ao||0||i + C3Aoc/||n||2maxl]]f (16) 

where Ci, C2 and C3 are constants depending on {A, a} only. 

Theorem H] shows that the error bounds are also of the order dXo under proper 
conditions. The crucial condition here is still the boundedness of the spectrum norm 
of the unknown matrix. 



6 Proofs 

In this section, we provide the proofs of Theorem [3l Theorem |2] and Theorem |H 
Theorem [T] is a brief version of Theorem |21 so we omit the proof. 

Proof of Theorem [3l The inequalities ( fT3|) are parallel to Theorem 2 in Sun 
and Zhang (2011). The only difference is that here we use the ii bound under the 
condition of the sign-restricted cone invertibility factor (SCIF). Since f3 = D~^^'^a., 
f ll4p follows from the second inequality in f lT5]) . 

Proof of Theorem El Let ^ > (^ + 1)/{A - 1), {a*Y = (3',^j^(3,j, = 

—1/2 

S^,fc |Sfc_*/3^ j |/(T* and z*^^^ = maxk^j Z(^j)^k- By Theorem 4, in the event z*.-^ < 
(l + rJ))-'/'Ao(e-l)/(e + l), 

TT^^l^J -r^' l^^-l^^'^--^^^^l^2eAo(i-.,7,)V2' (^^) 

where r,". = MO^l\Sj\/ SCIF,{^, S,; E_,) and r+ = MO^l\S,\/ SCIF,{^, S,; S_,). 
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We first derive some probabilistic bounds for some useful quantities. Since 5] = 
X'X /n and the rows of X follow a multivariate normal distribution with covariance 
matrix S*, we have a* = \\xj — X _jf3_jj\\/ y/n and = Xk{xj — X^jf3^j j)/a*. 
Thus, n{a*/ajY follows a distribution with n degrees of freedom and thus 

P{\{a*/a, f - 1| > v/(8/n) log(2p/e)} < e/p. (18) 

Also, we have that 2(j)^fc/{(l — zj^-^ k)l^'^ " -'-)}^^^ follows a t-distribution with n — 1 
degrees of freedom. We use Lemma 1 in Sun and Zhang (2011) with m = n — 1 and 
= log(p^/e) > 2 to obtain 

P{\z^j),k\ > V2\og{pye)/n} < {1 + en-i){e/p')/ y^nh^ifM- 

Thus, 

P{max|z(^)| > v/21og(pVe)/n} < 6, (19) 

i.e. the events z'^j^ < (l + r^^p^-'^/^Ao(^ — l)/(^ + l)(j = 1, . . . ,p) occur with probability 
greater than 1 — e. Since Tikk ~ ^kkXiJ^^ "^^ have 

P{ I E,,/S*, - 1 1 > V(8/r2)log(2p/e)} < e/p. (20) 

So there exists a small such that max|Sfcfc/S^;, — 1| < C holds for all k with 
probability greater than 1 — e. 

Now we need to bound SCIFi{C,, Sj;Y!i^j), for all j, with probability greater 
than 1 — e under the given conditions, where Sj = {i 7^ j '■ Pij 7^ 0}. Let Z = 
X(diagS*)~-'^/^. We discuss the bounds for SCIFi within the event max \T,kk/^lk " 

1|<C- 

For {\A\, \B\, \\u\\, \\v\\r) = ([a], [6], 1, 1) with A n 5 = 0, we define 
5^ = S^iX) =max{± (\\X'^XAu/n\\ - l) |, = ^(^^(X) = niax v'X'^Xsu/n. 

A,u K \ J J A,B,u,v 

For any subset T C { 1 , . . . , p} , we have 

C>€(^t), S^>S^{Xr), C<(l + 5+)^/^(l + 5+)^/^<l + ev.- (21) 
By Proposition 2(i) in Zhang and Huang (2008), we have 

P{(1 - c)V. < 1 - S-{Z) < 1 + d^{Z) < (1 + c) V} > 1 - e, (22) 

where c = ^JrnJn + ^J {2m/n) log(2p/e)(l + o(l)). We also have 

1 + 5t(x) < max(S*,/S,,)(l + S^iZ)) = (1 + - 0, 
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1 - S-{X) > min(E*,/E,fc)(l - 5,"(Z)) = (1 - 5,"(Z))/(1 + C). (23) 

Let kj = \Sj\. It follows from the shifting inequality in Ye and Zhang (2010) with 
i>d that 



sciF,{^,Sr,i:^,) > ^(i-5-^,(x_,)-ey^^l^;,^,^,(x„,)) 

> Y^{i-^4-.(^)-ey^(i + 4(^))} 



> 



1 f l-6-,{Z) d l + 6t,{Z) 

+ 1 + c V4£ 1-c ^ 



The second and the third inequalities follow from ( l2Ti) and ( l23l) . respectively. Let 
m = 4£ in ([22]) with £ = d{^p*/p^f > d. Then 

Under the condition (p* /p^,){(d/n)\ogp}^^^ < a for a small fixed a, c is also very 
small. Thus, with probability greater than 1 — e, SCIFi{^, Sj; S-j) are bounded by 
Cp^ for all j, where C is a constant only depending on ^, a}. 

Now we are ready to bound ii of the column of 0-0 by ([IT]), ([IS]), ([12]), 
and the uniform bound for SCIFi. The following inequalities hold with probability 
greater than 1 — 4e: 



|0.,-0;-||i < |0,,-0*,| + ||0-,,,-01,,, 



1 



cr 



-2 



cr 



ay \a,J a. 



< 110 



r^' \ 2 1 c I r^i \ I c 



P* (Tj(minfcSfcfc)^^V 



The first two inequalities just use some simple algebra, while the last one put ([E 
([18]), ([19]), ([20]) and the uniform bound for SCIFi together. The constants C[ and 
only depend on {A, a}. Therefore, the £i error of the matrix estimator is bounded 
by 

110-0* 111 < C!,\ld\\@*\Up:' + C',\odma.xei,p:\ 

Then the upper bound for ||0 — 0||i follows from the triangle inequality and the 
definition of 0, since ||0 — 0||i < ||0* — 0||i- 
For any matrix M and vector u, v, we have 

u'Mv = Y,M,,UiV, < [Y^M^.u^Y^M.^v^y^ < {\\M\\^ ■ ||M||i) '/'||ii|| ■ ||t;||. 
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So II2 < ||7Vf||oo ■ ||-^||i- For the symmetric matrix — 0, we have 

||0 - 0II2 < ||0 - 0||oo < ||0 - 0||i. 
The desired error bounds based the spectrum norm then follows. □ 

Proof of Theorem (4], The inequalities f lT8|) . f|T9|) . fl20|) and the uniform bound for 
SCIFi still hold. We first bound the ^1 norm of one column of O — O as follows 



< l^j ^jj-^jj^j ^jj I + II ^^-j 111- _2ni/2 ~ ^ 

< C[n,j\ld/P* + C'^\\^.M>^ld/p* + C^V(logp)/n) + C^l^^f Aorf/p: 



2v^l/2||^ II 



where the constants only depend on {A, a}. Thus, taking the maximum for both 
sides gives the error bounds for the matrix estimator O under the ii matrix norm. 
The rest proof for error bounds of — fi* under various norms are the same as that 
in the proof of Theorem 2. 
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